SIMARG DATASET

Adding this to skip data cleaning part

Index

1.

PRIMARY BASIC ANALYSIS OF DATA

Here we have studied the type of data we are working on. Computed some basic statistic about them and understood better our working environment.

Index

1.1

Basic stats

How many rows and how many columns are there in the data?

What are the names and datatypes in each column?

What percentage of SYN Scan - aggressive?

As we can see, the label is not equally divided. So we made up with the idea of using some oversampling techniques to balance our data.

1.2

General data for every feature

Index

With this simple command below we can analyze every possible value of every feature. We can notice that some features have a super high cardinality, sometimes even equal to the number of rows. Those features are gonna be dropped because won't give any useful information in our classification problem.

1.3

Statistics for Features

Index

The 2 possible values that can assume our LABEL feature are Normal Flow and SYN Scan - aggressive. The first one describes a normal connection, the second in some ways identifies a potential dangerous connection.

In our primary analysis can be useful to separate the malicious connections from the normal ones.

As the label is not balanced, we will analyze a portion of the rows equivalent to the numeber of Normal flows (approximately 500k).

The feature below can, from a first sight, represent a possible good

We can see how the measure of the flow duration is different in the two groups. It seems that normal connections lasts way more than the malicious ones.

Also the TCP max window size is way larger in normal connections.

IN_BYTES represents the number of incoming bytes from the connection. In the good connections the number of bytes is ten times that of malicious one.

An interesting analysis can be made on the protocol map feature, which identifies the protocol used (tcp or udp)

No UDP using connection is listed as malicious, we can assume this feature as a discriminant for a classification problem.

Malicious connections pass by TCP protocol.

2.

DATA CLEANING

This part is focused on clean data and drop feature that we consider not useful to our problem. We will use diverse criteria that are explained below.

Index

2.1

LOW-HIGH VARIABILITY : Looking at the dataset we noticed that some categorical features were composed mainly by just one category, so we decided to delete those categorical columns where the frequency of just one category was more than 99% (cover almost all the values of the column) we decided to delete them. We also proceeded to remove columns were the most frequent category appeared less than 0.0001% bacuse it meant that it was almost different for every row (so we considered it as an "ID" meaningless column.

Index

2.2

Definition of which columns are numerical and categorical

Index

2.3

Index

ENCODING: In our dataset we have some categorical features with very high cardinality. Since Label Encoding would have had some porblems with those features we decided to apply the Frequency Encoding.

One Hot encoding was not taken in consideration due to computational problems.

2.4

Index

OVERSAMPLING: due to the fact that we notified the presence of unbalanced class on column "LABEL" we decided to oversampe the minority class till to reach the same number of instances for both the classes of column "LABEL".

The algorithm used for oversampling is SMOTE.

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

2.5

Index

NORMALIZATION: here we normalized all the columns to to change the values in the dataset to a common scale, without distorting differences in the ranges of values. We opted for this step in way to avoid bias effects dependent on different range values.

3.0

Index

ISOLATION FOREST TO REMOVE OUTLIERS:

In the next cell it's performed a multivariate outlier detection with isolation forest. Once detected, the outliers have been removed.

Isolation forest works similar to a Random Forest Classifier. It uses contamination ratio as Threshold for considering point anomalous.

After found the possible outliers, we proceeded to discard them from our dataset.

4.

FEATURE SELECTION WITH CHI-SQUARE

Index

In feature selection, chi-square test aims to select the features which are highly dependent on the label.

In this section we perform to detect which are the more important features through the chi2 criteria. We defined wich are the features with an importance bigger than the median of all the feature importances.

4.1

Index

On the cell below we plot all the importances of the feature based on chi2 criteria.

Here we check only for the features which have an importance greater than the median of all feature importances. Once we get the list of those more important features we create a dataframe based only on those features.


5.

SUPERVISED LEARNING

Index

5.1

Adaboost

Index

Adaboost Classifier

The most important parameters are base_estimator, n_estimators and learning_rate.

The advantages are as follows:

AdaBoost is easy to implement.

The disadvantages are as follows:

5.2

RANDOM FOREST

Index

5.3

XGBoost

Index

5.4

Index

MLPC

Multi-layer Perceptron classifier

This model optimizes the log-loss function using LBFGS or stochastic gradient descent

5.5

Index

CLASSIFICATION RESULTS

Comments:

Our models result in approximately perfect models. Even algorithms that can prevent overfitting like Random Forest perform super well, we can eventually discard that type of error and assume our results as the right ones.

As our classification problem is about Intrusion Detection, our goal is to find out as faster as possible any dangerous connection. Hence we suggest to use the fastest model possible, in order to perform well even in short time intervals.

6.

UNSUPERVISED LEARNING

6.0

Index

Data preparation with PCA

6.1

Index

FuzzyKmeans

Converting the data frame into lists The algorithm in the apyori package is implemented in such a way that the input to the algorithm is a list of lists rather than a data frame. So we need to convert the data into a list of lists.

Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. The clustering process is not so far from hard clustering performed by KMeans, works with centroids and recalculate them until convergence.

6.1.0

Index

KMeans k=2

Let's plot some couples of features:

Let's analyze some data came out from clustering with k=2

In Cluster_1, we can see that the TCP protocol is more used than in Cluster_0. As far as we saw form the raw data, the bad connections are only performed by TCP. UDP is equally distributed between the two clusters, unfortunately this is not what we would expect as we know that UDP is used only in normal connections.

6.1.1

Index

Fuzzy clustering with best k = 6

Analyze the 6 Clusters came out:

We can find two categories of cluster between the 6 analyzed by kmeans. According to the duration of the connection, Cluster 3 and Cluster 5 have longer connections than other clusters. It could be interesting to find out the percentage of malicious connections in those two.

Here is an interesting result. The clusters with connections with longer duration result in not having malicious elements in them.

So we can assume that the malicious connection will lasts much less than normal ones. This feature can be a super discriminant for a classification problem.

6.1.2

Index

Analyze only malicious ones.

As we can see, the malicious connections have a mean duration similar in every cluster analyzed. Let's see which other feature can be observed to find out some charateristics between only the instrusions.

The only features that can be taken in consideration are L4_SRC_PORT and L4_DST_PORT. They describe the source port and the destination port rispectively.

It seems that in Cluster_3 there is only one Source port.

After Label encoding the most common ports are 5,6,7,8 Let's check the percentage of those ports in the different clusters.

The Destination port is not a good discriminant between different clusters.

Although PCA reveals its utility in finding components useful for clustering.

Source port reveals as a good discriminant.

If we see at port encoded as 66, the 100% of its occurrences are in C3, which is composed by connection coming only from that port.

C5 and C4 has almost no connection from port 66.

Maybe a certain type of malicious connection comes from that port.

6.2

Index

Agglomerative Clustering

It seems ok to perform a clustering with k=2, according to the dendogram results.

As we can see, the results are not so far from our Fuzzy Kmeans

7.

Index

SUPERLEARNER

The Superlearner is an ensembled algorithm which fits different models performing K-Fold CrossValidation on each of them. After that select model which performed best.

The procedure can be summarized as follows:

1. Select a k-fold split of the training dataset.
2. Select m base-models or model configurations.
3. For each basemodel:
    a. Evaluate using k-fold cross-validation.
    b. Store all out-of-fold predictions.
    c. Fit the model on the full training dataset and store.
4. Fit a meta-model on the out-of-fold predictions.
5. Evaluate the model on a holdout dataset or use model to make predictions.

The final result should be no worse than the best performing model evaluated during k-fold cross-validation and has the likelihood of performing better than any single model.

The SuperLearner results are in line with the previous ones. Even the best predictor gives a 100% Accuracy, Precision, Recall and f1 score.